Information Retrieval and Text Categorization with Semantic Indexing

نویسندگان

  • Paolo Rosso
  • Antonio Molina
  • Ferran Plà
  • Daniel Jiménez
  • Vicente Vidal
چکیده

In this paper, we present the effect of the semantic indexing using WordNet senses on the Information Retrieval (IR) and Text Categorization (TC) tasks. The documents have been sense-tagged using a Word Sense Disambiguation (WSD) system based on Specialized Hidden Markov Models (SHMMs). The preliminary results showed that a small improvement of the performance was obtained only in the TC task. 1 WSD with Specialized HMMs We consider WSD to be a tagging problem. The tagging process can be formulated as a maximization problem using the Hidden Markov Models (HMMs) formalism. Let S be the set of sense tags considered, and W, the vocabulary of the application. Given an input sentence, W = w1, . . . , wT , where wi ∈ W, the tagging process consists of finding the sequence of senses (S = s1, . . . , sT , where si ∈ S) of maximum probability on the model, that is: Ŝ = argmax S P (S|W ) = argmax S ( P (S) · P (W |S) P (W ) ) ; S ∈ S (1) Due to the fact that the probability P (W ) is a constant that can be ignored in the maximization process, the problem is reduced to maximizing the numerator of equation 1. To solve this equation, the Markov assumptions should be made in order to simplify the problem. For a first-order HMM, the problem is reduced to solving the following equation:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

تأملاتی بر نمایه‌ سازی تصاویر: یک تصویر ارزشی برابر با هزار واژه

Purpose: This paper presents various  image indexing techniques and discusses their advantages and limitations.             Methodology: conducting a review of the literature review, it identifies three main image indexing techniques, namely concept-based image indexing, content-based image indexing and folksonomy. It then describes each technique. Findings: Concept-based image indexing is te...

متن کامل

Using Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine

Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...

متن کامل

Two Hierarchical Text Categorization Approaches for BioASQ Semantic Indexing Challenge

This paper describes our participation in the BioASQ semantic indexing challenge with two hierarchical text categorization systems. Both systems originated from previous research in thesaurus topic assignment applied on small domains from the legal document management field. One of the described systems employs a classical top-down approach based on a collection of local classifiers. The other ...

متن کامل

Text Categorization and Information Retrieval Using WordNet Senses

In this paper we study the influence of semantics in the Text Categorization (TC) and Information Retrieval (IR) tasks. The K Nearest Neighbours (K-NN) method was used to perform the text categorization. The experimental results were obtained taking into account for a relevant term of a document its corresponding WordNet synset. For the IR task, three techniques were investigated: the direct us...

متن کامل

Music Genre Classification Using Text Categorization Method

Automatic music genre classification is one of the most challenging problems in music information retrieval and management of digital music database. In this paper, we propose a new method to classify music genres using text categorization methods. Differing from previous solutions which were mainly based on analysis on acoustic or symbolic audio signal, here we consider music as a text-like se...

متن کامل

Classification and clustering methods for documents by probabilistic latent semantic indexing model

Based on information retrieval model especially probabilistic latent semantic indexing (PLSI) model, we discuss methods for classification and clustering of a set of documents. A method for classification is presented and is demonstrated its good performance by applying to a set of benchmark documents with free format (text only). Then the classification method is modified to a clustering metho...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004